On the Statistical Consistency of DOP Estimators

نویسندگان

  • Detlef Prescher
  • Remko Scha
  • Khalil Sima'an
  • Andreas Zollmann
چکیده

A statistical estimator attempts to guess an unknown probability distribution by analyzing a sample from this distribution. One desirable property of an estimator is that its guess is increasingly likely to get arbitrarily close to the actual distribution as the sample size increases. This property is called consistency. Data Oriented Parsing (DOP) employs all fragments of the trees in a training treebank, including the full parse-trees themselves, as the rewrite rules of a probabilistic treesubstitution grammar. Since the most popular DOP-estimator (DOP1) was shown to be inconsistent, there is an outstanding theoretical question concerning the possibility of DOPestimators with reasonable statistical properties. This question constitutes the topic of the current paper. First, we show that, contrary to common wisdom, any unbiased estimator for DOP is futile because it will not generalize over the training treebank. Subsequently, we show that a consistent estimator that generalizes over the treebank should involve a local smoothing technique. This exposes the relation between DOP and existing memory-based models that work with full memory and an analogical function such as k-nearest neighbor, which is known to implement backoff smoothing. Finally, we present a new consistent backoff-based estimator for DOP and discuss how it combines the memory-based preference for the longest match with the probabilistic preference for the most frequent match.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Consistent and Efficient Estimator for the Data-oriented Parsing Model

Given a sequence of samples from an unknown probability distribution, a statistical estimator aims at providing an approximate guess of the distribution by utilizing statistics from the samples. One desired property of an estimator is that its guess approaches the unknown distribution as the sample sequence grows large. Mathematically speaking, this property is called consistency. This thesis p...

متن کامل

A Consistent and Efficient Estimator for Data-Oriented Parsing

Given a sequence of samples from an unknown probability distribution, a statistical estimator aims at providing an approximate guess of the distribution by utilizing statistics from the samples. One crucial property of a ‘good’ estimator is that its guess approaches the unknown distribution as the sample sequence grows large. This property is called consistency. This paper concerns estimators f...

متن کامل

On the Estimation of Shannon Entropy

Shannon entropy is increasingly used in many applications. In this article, an estimator of the entropy of a continuous random variable is proposed. Consistency and scale invariance of variance and mean squared error of the proposed estimator is proved and then comparisons are made with Vasicek's (1976), van Es (1992), Ebrahimi et al. (1994) and Correa (1995) entropy estimators. A simulation st...

متن کامل

Asymptotic Behaviors of Nearest Neighbor Kernel Density Estimator in Left-truncated Data

Kernel density estimators are the basic tools for density estimation in non-parametric statistics.  The k-nearest neighbor kernel estimators represent a special form of kernel density estimators, in  which  the  bandwidth  is varied depending on the location of the sample points. In this paper‎, we  initially introduce the k-nearest neighbor kernel density estimator in the random left-truncatio...

متن کامل

Back-off as Parameter Estimation for DOP models

Data-Oriented Parsing (DOP) is a probabilistic performance approach to parsing natural language. Several DOP models have been proposed since it was introduced by Scha (1990), achieving promising results. One important feature of these models is the probability estimation procedure. Two major estimators have been put forward: Bod (1993) uses a relative frequency estimator; Bonnema (1999) adds a ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003